法国专利FR3074594A1 AUTOMATIC EXTRACTION OF ATTRIBUTES OF AN OBJECT WITHIN A SET OF DIGITAL IMAGES

专利PDF首页>>法国专利

专利附录

专利说明

权利要求

类似技术

同族专利

引用文献

法律状态

优先权

专利摘要:
The invention relates to a method for the recognition of objects of a predefined type among a set of types, within a set of digital images, comprising - detecting (11) an object of this predefined type at within a digital image (10) of said set, and determining (12) an area (13) of said image encompassing the detected object; the generation (14) of a signature (15) by a convolutional neuron network from this zone, allowing unambiguous identification of the object; determination (16) from the signature of a set of attributes (17); storing (18) in a database (19) a record relating to said object associating the signature with the set of attributes; wherein the neural network is trained on a learning set composed of a first set of objects associated with a set of attributes and a second set of objects not associated with a set of attributes,
公开号:FR3074594A1
申请号:FR1761632
申请日:2017-12-05
公开日:2019-06-07
发明作者:Matthieu OSPICI；Antoine CECCHI；Pierre PALEO
申请人:Bull SA；
IPC主号:

专利说明:

The invention relates to the automatic extraction of attributes and the search for objects in a video or a set of digital images. It applies in particular to the monitoring of humans in a video stream for the purpose of video surveillance, marketing targeting, compilation of statistics, etc.
BACKGROUND OF THE INVENTION
Video cameras are increasingly deployed in public or private space to gather information on the behavior of human beings in a geographic area.
A typical application is video surveillance in order to detect information of interest as early as possible (for example suspicious behavior), in areas at risk or to be monitored (airport, train station, bank, etc.) or in any other space intended for receive from the public, including in the street.
Other applications may relate to commercial or marketing purposes in order to monitor and characterize the behavior of potential customers in a commercial area (shop, shopping center, supermarket, etc.).
Video streams form very large amounts of information, to the point that manual monitoring becomes difficult, costly or even impossible in certain situations.
Some research tools have been developed in recent years, but they are mainly based on facial analysis, and not on the complete appearance of a person in low resolution images.
Existing solutions therefore do not offer sufficient performance.
In addition, they aim to follow humans from one image to another in a stream of successive images. They do not allow you to search a bank of images previously collected to determine if a given person is there. A fortiori, they do not make it possible to find, within such a bank of images, if people corresponding to a set of attributes are there.
SUMMARY OF THE INVENTION
The object of the present invention is to provide a solution which at least partially overcomes the aforementioned drawbacks.
To this end, the present invention provides a method for recognizing objects of a set of predefined types within a set of digital images, comprising
- detecting an object of said predefined type within a digital image of said set, and determining an area of said image encompassing said detected object;
- the generation of a signature by a convolutional neural network from said area, allowing unequivocal identification of said object;
- the determination from said signature of a set of attributes;
- the storage in a database of a record relating to said object associating said signature to said set of attributes;
in which said neural network is trained on a learning game composed of a first set formed of objects associated with a set of attributes and a second set formed of objects not associated with a set of attributes,
According to preferred embodiments, the invention comprises one or more of the following characteristics which can be used separately or in partial combination with one another or in total combination with one another:
- said predefined type is a human;
- said digital images form a video stream;
- said convolutional neural network includes a Restnet50 network;
- said learning game is composed of a plurality of sub-games, each determined in a different operational context;
- the training of said convolutional neural network is supervised by a mechanism involving a "center loss";
the method further comprises a step of searching for a set of objects within said database from a set of values associated with attributes;
- The method further includes a step of searching for a set of objects within said database, from an image containing a single object to be searched;
- Said record associates said identifier with a set of values representative of the values of said attributes for a succession of digital images of said set.
The invention also relates to a computer program comprising program instructions for the execution of a method as defined above, when said program is executed on a computer.
Thus, the invention makes it possible to clearly improve the quality of the recognition and detection of objects by specifying their characteristics.
It also makes it possible to deal with additional fields of application compared to the solutions of the state of the art.
In particular, thanks to the association between the detected persons and attributes, the solution of the invention can make it possible to target video surveillance within a company or on a construction site, for example, in order to detect people not respecting certain safety rules (wearing a vest, helmet, etc.)
The invention also makes it possible to search for and extract attributes of objects other than human beings, such as in particular animals in the case of monitoring or calculating statistics on an animal park or on a breeding operation.
Other characteristics and advantages of the invention will appear on reading the following description of a preferred embodiment of the invention, given by way of example and with reference to the accompanying drawings.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 schematically represents an example of a flowchart illustrating an embodiment of the invention.
DETAILED DESCRIPTION OF THE INVENTION
The invention relates to the tracking of objects of a predefined type within a set of digital images.
The objects are typically humans, but other types of objects can be considered without making major modifications to the principles of the invention: animals, vehicles, etc.
In a first step, referenced 11, in FIG. 1, one (or more) objects of a predefined type (human, animal, vehicle, etc.) are detected within a digital image 10. The next step 12 consists in determining an encompassing zone for each of the detected objects. It is also possible to detect objects belonging to several types (for example detecting humans and vehicles, in the same process).
The digital image 10 can be an image coming directly from a video camera or from an image bank. It can represent a scene in which none, one or more objects can be present.
In addition, these objects can be represented in different ways on the digital image: in particular, in the case of a human, this can be seen from the front, from the back, in profile, standing, sitting, in a lighted area or in the dark, etc.
The determination of an area encompassing the object (s) detected consists in delimiting an area of predetermined type (typically a rectangle) containing the object detected in its entirety and of minimum dimensions, or substantially minimum (it may be provided margins between the ends of the object at the limits of the surrounding area.
In this way, the invention makes it possible to consider all kinds of “raw” images, such as those produced by a camera and to detect, independently, the presence of an object, and if necessary, to proceed thereafter. tracking this object, or these objects when the scene captured by the camera includes several.
These steps of detecting, 11, objects and determining, 12, of the surrounding areas can be carried out in different ways. According to the embodiments, these two stages can be carried out simultaneously, by the same mechanism.
Preferably, a convolutional neural network is used to simultaneously gradually reduce an area around an object and, progressively classified the object contained in the area, into one of the classes sought (which may be simply the presence or absence of a 'an object of a predetermined type).
In fact, a set of encompassing zones is obtained (ie reduced to an optimal surface) and a degree of confidence that it contains an object of the type sought.
An example of such an approach is described in Liu Wei, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng- Yang Fu and Alexander C. Berg, SSD: Single Shot Multibox Detector, European Conférence on Computer Vision (ECCV) , Springer International Publishing (2016), pp. 21-37.
The encompassing zones, 13, thus determined can then be used as inputs of the step of generation 14 of a signature by a convolutional neural network.
This signature, 15, makes it possible to determine a set of attributes characterizing the detected object, and also to identify the detected object. In other words, the same neuron network, via a single prediction phase, makes it possible to determine information of a different nature:
- a signature, which makes it possible to determine whether an object is the same from one image to another;
- a set of attributes allowing to characterize this object.
This way of proceeding allows performances compatible with real time since only one processing is necessary for the determination of the two types of information, but also, and above all, the determination is more precise by the simultaneous use of these two types of information during the learning (or training) phase of the convolutional neural network.
It is important to allow a good learning of the neural network to train it with a large number of examples.
There are several works which have made it possible to constitute sets of examples from actual captures from video cameras, to which attributes have been manually assigned to the objects represented.
The following page presents a certain number of such sets of examples, or "dataset" according to the usual terminology in English:
http://www.liangzheng.org/Project/project_reid.html
According to an embodiment of the invention, a composition of a plurality of sub-games is used as the neural network learning game, each determined in a different operational context, that is to say determined by a team of different researchers, having carried out their own methodology and in a clean environment (location of the camera, target population, etc.)
It is important that the number of sub-games is large in order to allow a better representativeness of the different images which will then be proposed to the neural network in the prediction phase, and therefore better performances in generalization.
As an example, the subsets used can be
- CHUK01, as defined in the article "Human Reidentificaiton with Transferred Metric Leaming", by Li Wei „Zhao Rui and Wang Wiaogang, ACCV, 2012
- CHUK03, as defined in the article "DeepRelD: Deep Filter Pairing Neural Network for Person Re-identification" by Li Wei, Zhao Rui, Xiao Tong and Wang Wiaogang, CVPR, 2014
- MARS, as defined in the article "MARS: A Video Benchmark for Large-Scale Person Re-identification", by Zheng Liang, Bie Zhi, Sun Yifan, Wang Jingdong, Su Chi and Wang Shengjin, and Tian Qi, in European Conference on Computer Vision (ECCV), 2016
- ViPER, as defined in the article "Viewpoint Invariant Pedestrian Récognition with an Ensemble of Localize Features", by D. Gray, and H. Tao, in European Conférence on Computer Vision (ECCV), 2008
- Marketl501, as defined in the article "Improving Person Re-identificiation by Attribute and Identity Leaming", by Lin Yutian, Zheng Liang, Zhang Zhedong, Wu Yu and Yang Yi, arXiv: 1703.07220, 2017
Each of these “datasets” includes data impacted by the conditions of acquisition and creation, although they all aim at a certain universality. By aggregating and combining these subsets, we minimize the biases implicitly brought about by the studies that helped to develop them.
In addition, according to the invention, the neural network is trained on a learning game composed of a first set formed of objects associated with a set of attributes and a second set formed of objects not associated with a set of attributes.
Thus, any type of training subset can be taken into account: those which have been the subject of a manual assignment of attributes can be poured into the first set, and the others into the second set.
This way of proceeding, known as “multitask learning” (or “multitask learning” in English), is intrinsically linked to the fact that the training of the neural network is carried out for a “mixed” signature allowing, as has been seen previously, identify the object detected and determine a set of attributes characterizing it.
Therefore, the first set will make the signature able to represent attributes in addition to being an independent identifier for the orientation of the object, while the second set will train the neural network to create a representative signature for an object (identifier), without impact on the general training of the network, in particular regarding its ability to extract attributes.
The neural network is typically a convolutional neural network (or “CNN” for “Convolutional Neural Network”), which is a well-known technique in the field of re-identification of an object in a succession of images, as indicated, in particular, in the article "Improving Per son Re-identificiation by Attribute and Identity Learning", by Lin Yutian, Zheng Liang, Zhang Zhedong, Wu Yu and Yang Yi, previously cited.
According to one embodiment of the invention, the convolutional neural network can be based on a “RestNet50” network, to which a few layers can be added (layers corresponding to the signature and to the attributes, layers corresponding to the classification, during training).
This "RestNet50" network was described in particular in the article "Deep Residual Learning for Image Recognition", by K. He, X. Zhang, S. Ren and J. Sun, in CVPR, 2016
In other words, the architecture of the convolutional neural network used to implement the invention can be known per se. The innovation lies in particular in the training mechanisms of this neural network, and in particular in the fact that the learning game is composed of a first set formed of objects associated with a set of attributes and a second set formed of objects not associated with a set of attributes
Consequently, the neural network makes it possible to generate a signature from which the attributes will also be extracted.
According to one embodiment of the invention, moreover, the training of said convolutional neural network is supervised by a mechanism involving a "center loss". Such a mechanism was proposed in the article “A Discriminative Feature Learning Approach for Deep Face Récognition” by Yandong Wen, Kaipeng Zhang, Zhifeng Li and Yu Qiao, in European Conférence on Computer Vision (ECCV), 2016, as part of facial recognition.
This mechanism aims to provide a signal which, during learning, will encourage the grouping of recognized characteristics belonging to the same class. It thus makes it possible to increase the discriminating character of the classes of learned characteristics.
It has been demonstrated experimentally that it is important, in the context of the invention, to indeed obtain well-discriminated classes at the output of the neural network.
More concretely, the "center loss" learning function tends to collect a signature obtained by presenting a digital image at the input of the neural network, around a "center" of the corresponding class. This center (that is to say a vector similar to a signature, corresponding to the output layer of the neural network) is thus the subject of learning at the same time as the neural network itself.
During the operation phase of the neural network, we can present it with a new digital image. The generalization capabilities of the neural network then allow the generation of a signature.
This signature corresponds to internal characteristics learned by the neural network (“deep features”).
The determination, 16 of a set of attributes, 17, from the signature, 15, can be implemented by a classification layer.
The signature can be seen as a vector of values. According to a concrete embodiment implemented by the inventors, the signature is a vector of 4096 values.
This signature 15 makes it possible to identify an object unequivocally.
If two separate digital images represent the same object, the neural network must be able to generate two similar signatures. Conversely, it must generate two separate signatures for two different digital images. According to a concrete embodiment implemented, the measurement of similarity is done by calculating the distance between the signatures, an example of distance being the cosine measurement.
From this signature 15 are extracted a set of attributes 17.
The attributes can depend on the type of object that one seeks to identify. In the case of humans, attributes may include: gender, colors of high clothing, colors of low clothing, hair length, etc.
More particularly, the values of the signature make it possible to establish a probability for each attribute. Different thresholding mechanisms can be implemented, depending on the use cases desired.
In a step 18, a record relating to the determined object is stored in a database 19, which associates its signature and its attribute values. Other information may also be stored in this recording, such as a time stamp or technical data (identifier of the video camera, etc.)
According to an embodiment of the invention, rather than storing attribute and signature values for each digital image, a value representative of the attribute and of the signature is calculated on a succession of digital images. This succession can be defined by a sliding time window of a video sequence. It can for example be an average value of the values (or probabilities) of the attributes as well as those of the signatures. The results are thus consolidated by ruling out possible outliers.
According to embodiments of the invention, the results stored in the database 19 can be exploited by carrying out searches there.
These searches can be triggered manually, when a user wishes to search for identified objects or corresponding to a report, etc.
They can also be triggered automatically, in order to determine different statistical results, comparisons between different sequences of images, etc.
Searches can be performed on the basis of a set of values associated with attributes: it is then sufficient to determine the set of records whose attribute values correspond to the search criteria on attributes. It is thus possible to search for all human beings having clothes of a given color, etc.
Searches can also be performed from an image containing a single object to be searched. This image can then be the subject of a processing similar to that described above in order to determine a signature, via the convolutional neural network. This signature can then be used as a search criterion in the database 19.
Of course, the present invention is not limited to the examples and to the embodiment described and shown, but it is susceptible of numerous variants accessible to those skilled in the art.

权利要求:
Claims (10)
[1" id="c-fr-0001]
1. Method for recognizing objects of a set of predefined types within a set of digital images, comprising
- detecting (11) an object of a predefined type among said set, within a digital image (10) of said set, and determining (12) an area (13) of said image encompassing said object detected;
- the generation (14) of a signature (15) by a convolutional neural network from said area, allowing identification of said object unequivocally;
- the determination (16) from said signature of a set of attributes (17);
- the storage (18) in a database (19) of a record relating to said object associating said signature with said set of attributes;
in which said neural network is trained on a learning game composed of a first set formed of objects associated with a set of attributes and a second set formed of objects not associated with a set of attributes,
[2" id="c-fr-0002]
2. Method according to the preceding claim, wherein said predefined type comprises a human.
[3" id="c-fr-0003]
3. Method according to one of the preceding claims, in which said digital images form a video stream.
[4" id="c-fr-0004]
4. Method according to one of the preceding claims, wherein said convolutional neural network comprises a Restnet50 network.
[5" id="c-fr-0005]
5. Method according to one of the preceding claims, wherein said learning game is composed of a plurality of sub-games, each determined in a different operational context.
[6" id="c-fr-0006]
6. Method according to one of the preceding claims, in which the training of said convolutional neural network is supervised by a mechanism involving a "center loss"
[7" id="c-fr-0007]
7. Method according to one of the preceding claims, further comprising a step of searching for a set of objects within said database from a set of values associated with attributes.
[8" id="c-fr-0008]
8. Method according to one of the preceding claims, further comprising a step of searching for a set of objects within said database, from an image containing a single object to be searched.
[9" id="c-fr-0009]
9. Method according to one of the preceding claims, in which said record associates said identifier with a set of values representative of the values of said attributes for a succession of digital images of said set.
[10" id="c-fr-0010]
10. Computer program comprising program instructions for the execution of a method according to one of the preceding claims, when said program is executed on a computer.

类似技术:

公开号 | 公开日 | 专利标题

EP3496000A1|2019-06-12|Automatic extraction of attributes of an object within a set of digital images

US9922271B2|2018-03-20|Object detection and classification

Feng et al.2010|Attention-driven salient edge | and region | extraction with application to CBIR

FR2974434A1|2012-10-26|PREDICTING THE AESTHETIC VALUE OF AN IMAGE

EP2316082A1|2011-05-04|Method for identifying an object in a video archive

EP3654285B1|2021-11-03|Object tracking using object attributes

Kumar et al.2020|The p-destre: A fully annotated dataset for pedestrian detection, tracking, and short/long-term re-identification from aerial devices

Conti et al.2014|Evaluation of time series distance functions in the task of detecting remote phenology patterns

WO2016139964A1|2016-09-09|Region-of-interest extraction device and region-of-interest extraction method

Brindha et al.2017|Bridging semantic gap between high-level and low-level features in content-based video retrieval using multi-stage ESN–SVM classifier

Kumar et al.2020|The P-DESTRE: a fully annotated dataset for pedestrian detection, tracking, re-identification and search from aerial devices

Rodionov et al.2018|Improving deep models of person re-identification for cross-dataset usage

WO2006032799A1|2006-03-30|Surveillance video indexing system

Ceroni et al.2018|Mining exoticism from visual content with fusion-based deep neural networks

Chaisorn et al.2013|Video analytics for surveillance camera networks

Kwak et al.2013|Human action classification and unusual action recognition algorithm for intelligent surveillance system

Kanagaraj et al.2021|A new 3D convolutional neural network | framework for multimedia event detection

Artan et al.2019|Vision based driver smoking behavior detection using surveillance camera images

Sun et al.2016|Hash length prediction for video hashing

EP3067709A1|2016-09-14|Method and device for tracking individuals in a location provided with distributed detection means

Burke et al.2017|Rapid Probabilistic Interest Learning from Domain-Specific Pairwise Image Comparisons

FR2872326A1|2005-12-30|Events e.g. car movement, detecting process for e.g. parking, involves classifying objects into categories using prediction model created during learning phase to indicate in which category object is to be present in audio/video sequence

Zhu et al.2021|Person re-identification in the real scene based on the deep learning

WO2008149047A2|2008-12-11|Device and method for processing images to determine a signature of a film

Zhang et al.2015|Automatic Preview Frame Selection for Online Videos

同族专利:

公开号 | 公开日

FR3074594B1|2021-01-29|

EP3496000A1|2019-06-12|

US11256945B2|2022-02-22|

US20190171899A1|2019-06-06|

WO2019110914A1|2019-06-13|

引用文献:

公开号 | 申请日 | 公开日 | 申请人 | 专利标题

US9111147B2|2011-11-14|2015-08-18|Massachusetts Institute Of Technology|Assisted video surveillance of persons-of-interest|

JP2018005555A|2016-07-01|2018-01-11|ソニー株式会社|Image processing device, information processing device and method, as well as program|

US10291949B2|2016-10-26|2019-05-14|Orcam Technologies Ltd.|Wearable device and methods for identifying a verbal contract|

US10366595B2|2017-03-10|2019-07-30|Turing Video, Inc.|Surveillance method and system based on human behavior recognition|CN110659599A|2019-09-19|2020-01-07|安徽七天教育科技有限公司|Scanning test paper-based offline handwriting authentication system and using method thereof|

EP3800577A1|2019-10-01|2021-04-07|Sensormatic Electronics, LLC|Classification and re-identification using a neural network|

FR3103601A1|2019-11-25|2021-05-28|Idemia Identity & Security France|Method of classifying a biometric fingerprint represented by an input image|

法律状态:
2019-06-07| PLSC| Publication of the preliminary search report|Effective date: 20190607 |

2019-12-23| PLFP| Fee payment|Year of fee payment: 3 |

2020-12-29| PLFP| Fee payment|Year of fee payment: 4 |

2021-12-15| PLFP| Fee payment|Year of fee payment: 5 |

优先权:

申请号 | 申请日 | 专利标题

FR1761632A|FR3074594B1|2017-12-05|2017-12-05|AUTOMATIC EXTRACTION OF ATTRIBUTES FROM AN OBJECT WITHIN A SET OF DIGITAL IMAGES|

FR1761632|2017-12-05|FR1761632A| FR3074594B1|2017-12-05|2017-12-05|AUTOMATIC EXTRACTION OF ATTRIBUTES FROM AN OBJECT WITHIN A SET OF DIGITAL IMAGES|

PCT/FR2018/053098| WO2019110914A1|2017-12-05|2018-12-04|Automatic extraction of attributes of an object within a set of digital images|

EP18306609.1A| EP3496000A1|2017-12-05|2018-12-04|Automatic extraction of attributes of an object within a set of digital images|

US16/210,693| US11256945B2|2017-12-05|2018-12-05|Automatic extraction of attributes of an object within a set of digital images|

[返回顶部]